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Evaluating text-to-speech synthesizers 

Walcir Cardoso', George Smith^, and Cesar Garcia Fuentes^ 


Abstract. Text-To-Speech (TTS) synthesizers have piqued the interest of 
researehers for their potential to enhanee the L2 aequisition of writing (Kirstein, 
2006), voeabulary and reading (Proetor, Dalton, & Grisham, 2007) and 
pronuneiation (Cardoso, Collins, & White, 2012; Soler-Urzua, 2011). Despite 
their proven effeetiveness, there is a need for up-to-date formal evaluations 
of TTS systems. The present study was an attempt to evaluate the language 
learning potential of an up-to-date TTS system at two levels: (1) speeeh quality 
(eomprehensibility, naturalness, aeeuraey, and intelligibility) and (2) foeus on a 
linguistie form (via a feature identifieation task). For Task 1, partieipants listened 
to and rated human- and TTS-produeed stories and sentenees on a 6-point seale 
(1); for Task 2, they listened to 16 human- and TTS-produeed sentenees to 
identify the presenee of a target feature (English regular past -ed). Results of 
paired samples t-tests indieated that for speeeh quality, the human samples earned 
higher ratings than the TTS samples. For the seeond task (past -ed pereeption), 
the TTS and human-produeed samples were equivalent. The diseussion of the 
findings will highlight how TTS ean be used to eomplement and enhanee the 
teaehing of L2 pronuneiation and other linguistie skills both inside and outside 
the elassroom. 

Keywords: eomputer-assisted language learning, CALL, text-to-speeeh, teehnology 
and language learning. 


1. Concordia University, Canada; walcir@education.concordia.ca 

2. University of Hawaii at Manoa, United States; gfsmith@hawaii.edu 

3. Coneordia University, Canada; eesgarfu@hotmail.eom 

How to cite this article: Cardoso, W., Smith, G., & Garcia Fuentes, C. (2015). Evaluating text-to-speech 
synthesizers. In F. Helm, L. Bradley, M. Guarda, & S. Thouesny (Eds), Critical CALL - Proceedings of the 2015 
EUROCALL Conference, Padova, Italy (pp. 108-113). Dublin: Research-publishing.net. http://dx.doi.org/10.14705/ 
rpnet.2015.000318 


108 


Evaluating text-to-speech synthesizers 


1. Introduction 

The provision of target language input of sufficient quality and quantity is an 
important issue in the field of second language acquisition. Three challenges which 
exist with this provision are: (1) the need for vast amounts of comprehensible input 
to develop language competence (Council of Europe, 2001 ; Krashen, 1985); (2) the 
need for learner-centered and personalized input (Chapelle, 2001); and (3) the need 
for exposure to a variety of speech models for robust phonological development 
(Barcroft & Sommers, 2005). 

Traditional face-to-face classroom settings may not be able to meet these criteria 
due to the inherent restrictions of this teaching context (e.g. teacher-centered, one 
variety of English used, lack of sustained input practice), especially in foreign- 
language settings (Cardoso et ah, 2012). One remedy to this problem lies in the 
use of TTS, which can offset some of the limitations of traditional classrooms 
given that they are highly fiexible, learner-centered, and easily accessible. Several 
studies have attested to the benefits of using TTS for learning writing (Kirstein, 
2006), vocabulary and reading (Proctor et al., 2007) and pronunciation (Cardoso 
et al., 2012; Soler-Urzua, 2011), both in and outside the classroom. 

Despite these theoretical and empirical benefits, however, there exist very few 
formal evaluations of TTS systems, specifically of their potential to promote 
the ideal conditions under which Second Language Learning (SLA) is thought 
to occur - a critical stage in the evaluation of Computer-Assisted Language 
Learning (CALL) applications (Chapelle, 2001; Handley & Hamel, 2005). Those 
evaluations which do exist have used a wide variety of rating methods, produced 
mixed results (some demonstrating the adequacy of TTS - Kang, Kashiwagi, 
Treviranus, & Kaburagi, 2009, some demonstrating inadequacy in some respects 
- Handley, 2009, Nusbaum, Francis, & Henley, 1995) and date back more than 5 
years. The present study was thus an attempt to provide an up-to-date evaluation 
of a state of the art TTS system concerning its potential to promote ideal SLA 
processes. The following two evaluation criteria were chosen: (1) the speech 
quality of the TTS system (input); and (2) the potential for learners to focus 
on linguistic form in these two types of input. Two research questions were 
formulated as follows: 

• What is the quality of speech produced by TTS systems in comparison with 
that by humans? 

• Can TTS systems provide learners with the opportunity to focus on form? 
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2. Method 

2.1. Participants and design 

Fifty-four university-level participants with a variety of LI backgrounds were 
recruited at an English-language university in Canada. Two tasks were designed to 
elicit participants’ perceptions of the TTS system: rating speech quality and feature 
identification. Both tasks had learners listen to speech samples produced by TTS 
and a human, with the goal being human-TTS equivalency; accordingly, a paired 
samples design was adopted for the analysis. 

2.2. Stimuli and materials 

A female speaker of North American English (Julie) in the program NaturalReader 
13 (2013) was used as the TTS system, and compared with a native-speaker of the 
same dialect with similar speech properties. For speech quality, participants listened 
to two stories and twelve sentences and rated them according to four judgment 
criteria: comprehensibility, naturalness, pronunciation accuracy, and intelligibility 
on a 6-point scale. Potential for focus on form was measured by having learners 
perform an aural feature identification task wherein they judged whether certain 
sentences contained a target grammar feature (English regular past -ed). Both 
the stories and sentences were adapted from materials produced by the ALERT 
research project (Collins et al., 2011). The tasks were performed via Microsoft 
PowerPoint in a quiet lab at the university, by a trained research assistant. 

2.3. Analysis 

Data came from the participants’ judgments of the stories and sentences that they 
heard and their accuracy on perceiving past -ed in decontextualized sentences such 
as ‘T hated the movie” and ‘T hate the movie”). The ratings of each participant were 
tallied, and means were calculated for each story and sentence. Accuracy scores 
were reported as raw scores, with a maximum of 8 points per speech source (i.e. 
human or TTS). Main analysis was carried out by means of paired samples t-tests, 
with an alpha level of .05 used for the determination of statistical significance. 

3. Results 

Results for the rating task were as follows. For the stories, paired samples 
t-tests revealed a significant difference in the rating scores on all categories 
(comprehensibility, ^(54)=-4.77,/?<.001 ; naturalness, ^(54)=-9.35,/><.001 ; accuracy. 
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^(54)=-7.32, /><.001; and intelligibility, ^(54)=-6.40, /?<.001). Similarly, paired 
samples t-tests for the sentences also revealed significant differences between the 
human- and TTS-produced samples for all measures (comprehensibility, ^(54)=- 
6.13, /?<.001; naturalness, ^(54)=-7.63, /><.001; accuracy, ^(54)=-7.34, /><.001; 
and intelligibility, ^(54)=-6.1 !,/><. 001). For the past -ed identification task, paired 
samples t-tests revealed no significant differences between the TTS- and human- 
produced speech samples (^(54)=-1.93, />=.059). Table 1, Table 2, and Table 3 
below show the descriptive statistics according to each task. 


Table 1 . Descriptive statistics for story rating 



Comprehensibility 

Naturalness 

Accuracy 

Intelligibility 


Mean 

SD 

Mean 

SD 

Mean 

SD 

Mean 

SD 

Human 

5.66 

0.67 

5.61 

0.68 

5.82 

0.47 

5.54 

0.79 

TTS 

5.14 

0.94 

3.64 

1.63 

4.77 

1.06 

4.50 

1.18 


Table 2. Descriptive statistics for sentence rating 


Comprehensibility 
Mean SD 

Naturalness 
Mean SD 

Accuracy 
Mean SD 

Intelligibility 
Mean SD 

Human 

5.90 

0.29 

5.65 

0.35 

5.74 0.29 

5.80 

0.38 

TTS 

5.36 

0.69 

4.24 

1.50 

5.35 0.83 

4.94 

1.01 


Table 3. Descriptive statistics for feature identification task 



Mean 

SD 

Human 

6.03 

1.38 

TTS 

5.59 

1.12 


4. Discussion and conclusions 

The present study sought to evaluate the speech quality and potential to focus 
on linguistic form provided by a state-of-the-art TTS system. First, the results 
revealed that the samples produced by the TTS system were rated significantly 
lower than the human-produced samples for all four categories of speech quality 
(comprehensibility, naturalness, pronunciation accuracy, intelligibility), at both 
the story and sentence levels. This echoes previous findings that have shown 
less favorable ratings for TTS-produced speech compared to human speech (e.g. 
Handley, 2009; Handley & Hamel, 2005; Nusbaum et al., 1995). However, it is 
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important to observe that the mean rating scores assigned to the TTS system for 
3 out of the 4 categories (naturalness excluded) were relatively high (4. 5-5. 3 out 
of 6). Thus, the speech quality of this particular TTS system can be considered as 
having achieved the “top rating(s)” needed for advancement to the next stage of 
evaluation (i.e. the success of activities using TTS) and use in language learning in 
general (Handley, 2009). The results of the past -ed perception task offer similarly 
promising results. Statistical equivalency was found for participants’ ability to 
detect the presence of the target feature (past -ed) with high accuracy (~5.5 or 6 
out of 8). This indicates that regardless of the source of delivery (human or TTS), 
participants were equally able to perceive the target form in running speech. 

Implications of these results are that modem TTS systems seem to be ready for 
advancement to further stages of evaluation, but more importantly, for use in 
language learning activities, particularly as a supplemental source of input which 
can cater to learners’ individual needs and interests. Future research should not only 
undertake evaluations of TTS’ success as a learning tool in classrooms (particularly 
in English as a foreign language classrooms, where language exposure is limited), 
but also continue evaluations for a variety of other factors, such as the level of 
cognitive processing involved in listening to computer-generated speech. 
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